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Abstract 


This  paper  presents  a  Learning-based  Nonlinear  Model  Predictive  Control  (LB-NMPC) 
algorithm  to  achieve  high-performance  path  tracking  in  challenging  off-road  terrain  through 
learning.  The  LB-NMPC  algorithm  uses  a  simple  a  priori  vehicle  model  and  a  learned 
disturbance  model.  Disturbances  are  modelled  as  a  Gaussian  Process  (GP)  as  a  function  of 
system  state,  input,  and  other  relevant  variables.  The  GP  is  updated  based  on  experience 
collected  during  previous  trials.  Localization  for  the  controller  is  provided  by  an  on-board, 
vision-based  mapping  and  navigation  system  enabling  operation  in  large-scale,  GPS-denied 
environments.  The  paper  presents  experimental  results  including  over  3  km  of  travel  by 
three  significantly  different  robot  platforms  with  masses  ranging  from  50  kg  to  600  kg  and 
at  speeds  ranging  from  0.35  m/s  to  1.2  m/s.1  Planned  speeds  are  generated  by  a  novel 
experience-based  speed  scheduler  that  balances  overall  travel  time,  path-tracking  errors, 
and  localization  reliability.  The  results  show  that  the  controller  can  start  from  a  generic 
a  priori  vehicle  model  and  subsequently  learn  to  reduce  vehicle-  and  trajectory-specific 
path-tracking  errors  based  on  experience. 

1  Associated  video  at  http://tiny.cc/RoverLearnsDisturbances 
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DMRV 

Mass  600  kg 

Size  (LW)  2m  x  1.5m 

Steering  Ackermann  Steering 


ROC6 

150  kg 

1.5  m  x  0.5  m 
Skid  Steering 


Clearpath  Husky 

50  kg 

0.9  m  x  0.6  m 
Skid  Steering 


Figure  1:  Robots  used  to  demonstrate  the  effectiveness  of  the  learning  controller.  Despite  significant  dif¬ 
ferences  in  robot  mass,  wheel  base,  kinematics,  and  actuator  designs,  the  algorithm  uses  the  same  nominal 
model  for  all  three  robots  and  learns  disturbances  over  trials  in  order  to  accurately  track  desired  paths. 


1  Introduction 


It  is  well  recognized  that  autonomous  guidance,  navigation,  and  control  of  mobile  robots  in  unstructured, 
off-road  terrain  is  one  of  the  highest  goals  in  field  robotics.  For  example,  robots  capable  of  autonomous 
off- road  operation  would  be  useful  in  law  enforcement,  disaster  search  and  rescue,  military,  forestry,  and 
mining  applications.  However,  operation  in  off-road  terrain  requires  advanced  control  techniques  to  mitigate 
the  effects  of  unmodelled  surface  materials  (e.g.,  snow,  sand,  grass),  terrain  topography  (e.g.,  side-slopes, 
inclines),  and  complex  robot  dynamics  (Figure  1).  Finding  representative  a  priori  models  for  such  effects  is 
challenging  since  (i)  the  terrain  is  often  not  known  ahead  of  time,  (ii)  robot-terrain  interaction  models  often 
do  not  exist,  and  (iii)  even  if  such  models  did  exist,  finding  model  parameters  is  cumbersome.  In  the  last 
decade,  there  has  been  significant  work  on  learning-based  controllers  for  robotics  (Schaal  and  Atkeson,  2010; 
Nguyen-Tuong  and  Peters,  2011).  Learning-based  algorithms  alleviate  the  need  for  significant  engineering 
work  to  identify  and  model  all  disturbances  that  a  model-based  controller  may  be  required  to  mitigate. 

In  this  paper,  we  investigate  a  Learning-Based  Nonlinear  Model  Predictive  Control  (LB-NMPC)  algorithm 
for  a  path-repeating  mobile  robot  operating  in  challenging  outdoor  terrain.  The  algorithm  uses  a  fixed, 


simple  robot  model  and  a  learned,  non-parametric  disturbance  model.  The  goal  is  to  reduce  path-tracking 
errors  using  real-world  experience  and  a  disturbance  model  instead  of  pre-programming  accurate  analytical 
models  that  are  generally  difficult  to  derive.  Disturbances  represent  measured  discrepancies  between  the  a 
priori  model  and  the  observed  system  behavior.  They  are  modelled  as  a  Gaussian  Process  (GP)  based  on 
previous  experience  as  a  function  of  state,  input,  and  other  relevant  variables.  Modelling  the  disturbances  as 
a  GP  enables  the  algorithm  to  learn  complex  nonlinear  model  discrepancies  and  generalize  to  novel  situations. 

We  also  investigate  a  novel  experience-based  speed  scheduler.  During  the  first  trial,  when  the  controller  is 
solely  based  on  the  a  priori  model,  the  speed  is  set  such  that  path  tracking  is  achieved  with  tolerable  errors 
and  reliable  vision-based  localization.  After  this  first  pass,  the  scheduler  adjusts  the  planned  speed  based  on 
previous  experience  (i.e. ,  tracking  errors  and  localization  quality).  The  new  speed  schedule  achieves  faster 
overall  path  completion  while  guaranteeing  low  path-tracking  errors  and  reliable  localization.  In  this  system, 
the  LB-NMPC  algorithm  is  also  shown  to  interpolate  and  extrapolate  from  experience.  Preliminary  results 
for  the  scheduler  operating  with  a  fixed  feedback  controller  have  been  published  in  Ostafew  et  al.  (2014a). 

Localization  for  the  controller  is  provided  by  an  on-board,  Visual  Teach  &  Repeat  (VT&R)  mapping  and 
navigation  algorithm  enabling  operation  in  large-scale,  GPS-denied  environments  (Furgale  and  Barfoot,  2010; 
Stenning  et  al.,  2013).  In  the  first  operational  phase,  the  teach  phase,  the  robot  is  piloted  along  the  desired 
path.  Localization  in  this  phase  is  obtained  relative  to  the  robot’s  starting  position  by  Visual  Odometry  (VO), 
computing  pose  changes  over  sequential  images  based  on  extracted  feature  maps  (3D  positions  and  associated 
descriptors).  Then  at  discrete  points  along  the  desired  path,  the  algorithm  stores  the  currently  viewed  feature 
map.  During  the  repeat  phase,  the  algorithm  re-localizes  against  stored  feature  maps  given  the  current  robot 
view,  thus  generating  feedback  for  a  path-tracking  controller  (Figure  2).  As  long  as  a  sufficient  number  of 
feature  matches  are  made  between  the  live  robot  view  and  the  stored  feature  maps,  the  system  generates 
consistent  localization  over  trials  and  is  able  to  support  a  learning  control  algorithm. 

The  key  contributions  of  this  paper  are:  (i)  a  path-tracking,  LB-NMPC  algorithm  based  on  a  simple  a 
priori  process  model  and  learned  disturbance  model,  (ii)  an  experience-based  speed  scheduler  balancing 
overall  travel  time,  path-tracking  errors,  and  vision-based  localization  reliability,  and  (iii)  extensive  outdoor 
experiments  on  three  different  robot  platforms  ranging  from  50  kg  to  600  kg  with  both  skid  and  Ackermann 
steering  (Figure  1).  This  paper  is  an  extension  of  previous  work  (Ostafew  et  al.,  2014b).  Significant  additions 
include  a  detailed  description  and  discussion  of  the  LB-NMPC  algorithm,  an  illustrative  example,  and  an 
additional  experiment  showing  successful  learning  on  an  Ackermann-steered  robot;  all  previous  test  robots 
were  skid-steered.  The  structure  of  this  paper  is  as  follows.  Section  2  relates  our  work  to  other  research  in 


Figure  2:  Visual  representations  of  re-localization  by  our  Visual  Teach  &  Repeat  (VT&R)  navigation  al¬ 
gorithm  with  high  (left  image)  and  low  (right  image)  path-tracking  errors.  Each  feature  track  represents 
the  translation  between  a  feature  identified  during  the  VT&R  teach  phase  and  the  current  repeat  phase. 
Reducing  the  path-tracking  errors  leads  to  improved  reliability  of  our  VT&R  algorithm  since  it  is  sensitive 
to  perspective  changes. 

this  field.  Sections  3  and  4  describe  the  proposed  LB-NMPC  algorithm  and  implementation  details  for  our 
experiments,  respectively.  Section  3.4  presents  simulation  results,  giving  an  intuition  of  the  benefits  of  the 
LB-NMPC  algorithm,  while  Section  5  presents  experimental  results,  demonstrating  the  successful  operation 
of  the  algorithm  in  practice.  Finally,  Sections  6  and  7  present  a  discussion  and  conclusion. 


2  Related  Work 

In  this  section,  we  present  related  work  aimed  at  achieving  high-performance  path-tracking  in  spite  of 
unknown  disturbances.  First  we  provide  a  brief  review  of  approaches  involving  model-based  control  with  (in 
some  cases)  online  parameter  identification.  Then  we  provide  a  background  on  learning  control  approaches. 

2.1  Model-based  and  Adaptive  Controllers 

In  our  testing,  the  two  largest  sources  of  disturbances  were  the  unmodelled  dynamics  of  the  robot  (including 
actuators),  and  the  wheel-terrain  interactions.  One  method  for  mitigating  the  effects  of  unknown  wheel- 
terrain  interaction  is  to  design  a  robot  with  all-wheel  drive  and  steering  such  that  lateral  and  angular  vehicle 
slip  can  be  compensated  directly.  For  example,  Ishigami  et  al.  (2009)  estimate  the  vehicle  slip  angle  and  path¬ 
tracking  errors  using  visual  and  wheel  odometry.  Then,  using  two  separate  proportional  feedback  controllers, 
they  command  the  front  wheels  so  as  to  reduce  path-tracking  errors,  and  the  rear  wheels  so  as  to  compensate 
for  the  vehicle  slip  angle.  Similarly,  Helmick  et  al.  (2006,  2009)  and  Angelova  et  al.  (2007)  estimate  lateral 
and  angular  vehicle  slip  rates  using  visual  and  wheel  odometry.  Then  they  use  proportional  feedback  control 


to  generate  desired  lateral  and  angular  velocities  to  compensate  for  vehicle  slip  rates.  Finally,  they  use 
the  robot’s  inverse  dynamics  to  generate  desired  individual  wheel  speeds  and  orientations.  However,  these 
approaches  can  only  react  to  path-tracking  errors  and  vehicle  slip.  On  the  other  hand,  our  approach  is 
based  on  Model  Predictive  Control,  including  a  learned  model  representing  wheel-terrain  interactions,  robot 
dynamics,  and  other  systematic  disturbances,  and  can  therefore  act  in  anticipation  of  tracking  errors. 

Cariou  et  al.  (2009)  and  Guillet  et  al.  (2013)  propose  online  adaptive  controllers  mitigating  wheel  slip  and 
robot  dynamics.  They  demonstrate  feedback-linearized  controllers  based  on  kinematic  models  extended  with 
wheel  slip  angles.  The  slip  angles  are  estimated  online  using  observers.  They  address  robot  dynamics  using 
a  predictive  controller  including  future  path  curvatures  and  offline  tuned  values  representing  actuator  delay 
and  robot  inertias.  Unlike  these  controllers,  which  partially  identify  and  model  disturbances  a  priori ,  our 
learning  approach  treats  disturbances,  including  both  wheel  slip  and  robot  dynamics,  as  a  GP,  enabling  the 
representation  of  complex  disturbance  characteristics  not  known  prior  to  operation.  Furthermore,  since  dis¬ 
turbances  are  learned  and  stored  in  memory,  our  algorithm  can  anticipate  disturbances  and  act  accordingly. 

Model  Predictive  Control  (MPC)  is  a  control  framework  that  uses  a  process  model  directly.  The  current 
control  action  is  obtained  by  solving,  at  each  sampling  instant,  a  finite-horizon  optimal  control  problem 
using  the  current  state  of  the  plant  as  the  initial  state  (Rawlings  and  Mayne,  2009).  Kiihne  et  al.  (2005), 
Klancar  and  Skrjanc  (2007),  and  Xie  and  Fierro  (2008)  present  MPC-based  mobile  robot  controllers  based 
on  kinematic  models  and  show  results  for  robots  traveling  on  smooth,  flat  surfaces.  Howard  et  al.  (2009) 
demonstrate  MPC  on  a  large-scale,  outdoor  robot  navigating  intricate  paths.  Finally,  Peters  and  Iagnemma 
(2008)  demonstrate  MPC  for  a  mobile  robot  where  the  process  model  includes  effects  such  as  tire  defor¬ 
mation,  wheel-terrain  interaction,  and  suspension  compliance.  However,  in  each  of  these  examples,  the 
controllers  are  based  on  a  priori  models  and,  in  some  cases,  rely  on  parameters  whose  determination  in 
practice  is  challenging.  For  example,  Iagnemma  et  al.  (2004)  demonstrate  an  online  method  of  estimating 
terramechanics  paramaters  where  the  rover  speed  is  restricted  to  10  cm/s  in  order  to  assume  a  quasi-static 
analysis.  At  higher  speeds,  Seegmiller  et  al.  (2013)  determine  vehicle  model  parameters,  including  wheel  slip, 
by  integrated  prediction  error  minimization.  In  this  paper,  our  NMPC  algorithm  is  based  on  a  fixed  nominal 
model  and  a  learned,  non-parametric  disturbance  model.  This  reduces  the  need  for  accurate  a  priori  process 
models  and  parameter-specific  observers  while  maintaining  the  benefits  of  MPC  such  as  predictive  behavior 
and  constraint  handling. 


2.2  Learning  Controllers 


Unlike  controllers  based  on  fixed  models,  controllers  using  learned  models  gather  data  over  time,  incremen¬ 
tally  constructing  accurate  approximations  of  the  true  system  model.  In  this  paper,  we  model  disturbances 
as  a  GP  based  on  input-output  data  from  previous  trials.  This  approach  enables  both  model  flexibility  and 
consistent  uncertainty  estimates  (Rasmussen,  2006).  For  example,  Kocijan  et  al.  (2004)  combine  a  GP  model 
and  MPC  for  the  control  of  a  simulated  pH  neutralization  process.  They  represent  the  full  dynamics  of  the 
system  by  a  GP  model  trained  on  400  observations  of  the  chemical  system.  MPC  is  applied  to  control  the 
system  based  on  the  offline-identified  GP  model  (i.e. ,  no  online  learning).  While  their  work  was  restricted 
to  offline  simulation,  our  algorithm  is  used  for  real-time  path-tracking  and  learns  from  trial  to  trial.  Sparse 
GP  approximations  are  one  approach  to  enable  fast,  online  GP  evaluation,  and  do  so  by  discarding  some 
training  points  and  keeping  only  ‘inducing  inputs’,  also  known  as  ‘support  points’  (Quihonero-Candela  and 
Rasmussen,  2005).  An  alternative  are  Local  GP  (LGP)  methods,  as  implemented  in  this  work,  which  enable 
online  operation  by  dividing  the  GP  input  space  into  smaller  subspaces  and  generating  an  LGP  for  each 
subspace  (Rasmussen  and  Ghahramani,  2002;  Snelson  and  Ghahramani,  2007).  For  example,  Nguyen-Tuong 
et  al.  (2009)  and  Meier  et  al.  (2014)  focus  on  achieving  online  operation  and  use  LGP  models  to  approximate 
the  inverse  dynamics  of  7-DoF  manipulator  arms.  Unlike  these  two  examples  where  many  LGP  models 
are  generated  for  operation,  we  rapidly  compute  a  single  LGP  model  online  based  on  a  sliding  window  of 
learned  data  and  use  NMPC  to  enable  predictive  control.  Finally,  robustness  of  learning  controllers  is  a 
large  unanswered  question.  Aswani  et  al.  (2013)  focus  on  developing  a  safe  and  robust  LB-MPC  approach 
using  Tube  MPC  (Langson  et  al.,  2004).  The  approach  produces  optimal  inputs  based  on  the  learned  system 
dynamics.  However,  they  ensure  safety  and  robustness  by  checking  whether  these  inputs  keep  the  nominal 
model  stable  when  it  is  subject  to  uncertainty.  In  this  paper,  we  do  not  explicitly  consider  the  robustness  of 
the  controller  but  focus  on  the  practical  application  of  LB-NMPC  to  mobile  robots.  This  requires  continuous 
operation  from  the  first  trial  and  representation  of  complex  disturbances  by  the  learned  model. 

Additionally,  Iterative  Learning  Control  (ILC)  and  Reinforcement  Learning  (RL)  are  two  other  common  ap¬ 
proaches  to  learning  from  experience.  Schocllig  et  al.  (2012)  and  Ostafew  et  al.  (2013)  present  ILC  algorithms 
for  quadrotors  and  mobile  robots,  respectively,  that  learn  a  feedforward  control  signal  over  sequential  trials. 
Unlike  ILC,  our  LB-NMPC  algorithm  learns  a  flexible,  general  disturbance  model  that  allows  interpolation 
and  extrapolation  of  learned  experience,  and  thus  covers  multiple  paths  and  speed  schedules  simultaneously. 
RL,  on  the  other  hand,  learns  a  control  policy  that  maximizes  a  cumulative  expected  reward.  For  example, 
Abbeel  et  al.  (2006)  and  Ko  et  al.  (2007)  present  a  RL  algorithms  for  the  control  of  a  mobile  robot  and  an 
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Figure  3:  The  LB-NMPC  algorithm  is  composed  of  two  parts:  1)  the  path-tracking  NMPC  algorithm  that 
includes  a  nominal  process  model,  and  2)  the  GP-based  Disturbance  Model.  During  the  first  trial,  the 
algorithm  relies  solely  on  the  nominal  process  model  to  guide  the  vehicle  along  the  desired  path,  z d-  In 
subsequent  trials,  the  NMPC  algorithm  uses  the  disturbance  model  as  a  correction  to  the  nominal  model  at 
states,  a,  to  be  defined  in  Section  3.1.  Dashed  lines  indicate  that  the  signals  z *.  and  u*,  update  the  model. 


autonomous  blimp,  respectively.  However,  unlike  our  algorithm,  which  provides  continuous  operation  from 
the  first  trial,  RL  is  known  to  require  a  prohibitively  large  number  of  training  examples  before  operation,  an 
issue  for  RL  that  is  the  focus  of  much  current  work  (Deisenroth  et  ah,  2014). 


3  Mathematical  Formulation 

3.1  Nonlinear  Model  Predictive  Control 

At  a  given  sample  time,  the  NMPC  algorithm  finds  a  sequence  of  control  inputs  that  optimizes  the  plant 
behaviour  over  a  prediction  horizon  based  on  the  current  state.  The  first  input  in  the  optimal  sequence 
is  then  applied  to  the  system,  resulting  in  a  new  system  state.  The  entire  process  is  repeated  at  the  next 
sample  time  for  the  new  system  state.  In  traditional  NMPC  implementations  (Rawlings  and  Mayne,  2009), 
the  process  model  is  specified  a  priori  and  remains  unchanged  during  operation.  In  this  paper,  we  augment 
the  process  model  with  a  disturbance  model  generated  from  experience  in  order  to  compensate  for  effects  not 
captured  by  the  fixed  process  model,  such  as  environmental  disturbances  and  unknown  dynamics  (Figure  3). 

3.1.1  Full-State  Feedback  Control 

Consider  the  following  nonlinear,  state-space  system: 


Zfc+l  —  ftrue  (z/c ,  Ufc  )  , 


(1) 


with  observable  system  state,  z*,  £  R”,  and  control  input,  u^.  £  Rm,  both  at  time  k.  In  this  work,  the  true 
system  is  not  known  exactly  and  is  represented  by  the  sum  of  an  a  priori  model  and  an  experience-based, 
learned  model, 

a  priori  model  learned  disturbance  model 
Zfc+1  =  f(zfe,Ufc)  +  g(zfc,Ufe).  (2) 

The  models  f(-)  and  g(-)  are  nonlinear  process  models:  f(-)  is  a  known  nominal  process  model  representing 
our  knowledge  of  ftrue(')>  §(')  is  an  (initially  unknown)  disturbance  model  representing  discrepancies  between 
the  nominal  model  and  the  actual  system  behavior.  The  system  is  further  assumed  to  be  Markovian,  thus 
the  processes  f(-)  and  g(-)  involve  only  states  from  the  current  time. 

As  previously  mentioned,  the  objective  of  the  NMPC  algorithm  is  to  find  a  set  of  controls  that  optimizes  the 
plant  behaviour  over  a  given  prediction  horizon.  To  this  end,  we  define  the  cost  function  to  be  minimized 
over  the  next  K  time-steps  to  be 


J (u)  =  (zd  -  z)tQ  (zd  -  z)  +  utR  u,  (3) 

where  Q  £  ]gKnxKn  js  positive  semi-definite,  R  £  ]gKmxKm  is  positive  definite,  u  is  a  sequence  of  control 
inputs,  u  =  (ufc, . . . ,  Uk+K-i ),  Zd  is  a  sequence  of  desired  states,  zd  =  (z^fc+i, . . . ,  zd^+x),  z  is  a  sequence 
of  predicted  states,  z  =  ( z/~+i , . . . ,  z k+x),  obtained  from  (2)  when  applying  u,  and  AT  is  a  given  prediction 
horizon  length.  Weighting  on  the  state  begins  at  time  k  +  1  since  the  state  at  time  k  can  no  longer  be 
affected  by  the  control  input.  Also,  by  requiring  R  to  be  positive  definite,  inputs  are  guaranteed  to  be  finite. 
Further  restrictions  on  control  inputs  or  states  are  commonly  imposed  using  constraints  when  solving  for  the 
optimal  control  input  (Diehl  et  ah,  2009). 

Since  both  our  process  model  and  disturbance  model  are  nonlinear,  the  minimum  of  J(u)  must  be  found 
iteratively  using  a  nonlinear  optimization  technique.  In  this  paper,  we  use  unconstrained  Gauss-Newton 
minimization  (Nocedal  and  Wright,  1999)  to  solve  the  nonlinear  least-squares  problem.  We  begin  by  lin¬ 
earizing  around  an  initial  guess  for  the  optimal  control  input  sequence,  u,  with  u  =  u  +  (5u.  A  good  initial 
guess  for  u  is  the  sequence  of  optimal  inputs  calculated  in  the  previous  time-step.  For  the  first  time-step,  we 
use  u  =  0.  With  z  representing  a  sequence  of  states  obtained  from  (2)  when  applying  u  and  with  z  =  z  +  <5z, 
and  Zfc  =  Zfc,  we  find 


Zfc+fc+l  f(Zfc-j-6,  U^_(_^)  T  g(Zfc_|_5,  U- A, -(-£> ) , 


(4) 


and 


~  H“  Hu?^_)_5  (5u^_)_5, 


(5) 


where 


H  z,k-\-b 


df(-) 


dz 

di(-) 


du 


5fc  +  b,Ufc_|_f, 


Zfc  +  b,Ufc  +  b 


dg(-) 

dz 


<9g(0 

3u 


Zfc  +  b,Ufc_|_b 


Zfc  +  b,Ufc  +  b 


(6) 


for  b  £  {0, . . . ,  K  —  1}.  In  the  case  of  f(-),  we  have  an  analytical  model  and  in  the  case  of  g(-),  the  derivatives 
are  tractable  so  long  as  a  continuously  differentiable  kernel  function  is  chosen  for  use  in  the  Gaussian  process 
model  (see  Section  3.2).  As  zfc  is  the  current  state  as  measured,  5zk  =  0.  Given  (5)  and  (6),  we  have 


6z  =  Hzdz  +  Hudu 
=  (1-Hb)-1Hu5u 
=  H'du, 


where  1  represents  an  identity  matrix,  Hu  = 


H,  = 


0  0 

A,  0 


diag(HUjk, . . . ,  HUik+K-i),  and 


Az  —  diag(HZjk+i, . . . ,  Hz,k+K— l)- 


Substituting  z  =  z  +  6z,  (9),  and  u  =  u  +  du  into  (3)  results  in  J(-)  being  quadratic  in  du, 


J(u)  =  (zd  —  z  —  5z)TQ(zd  —  z  dz)  +  (u  +  du)TR(u  +  du) 

~  (zd  —  z  —  H,du)TQ(z£;  —  z  —  H'du)  +  (u  +  du)TR(u  +  du). 


We  can  find  the  value  of  du  that  minimizes  </(•)  by  solving 


dJ(  u) 
dSu 


=  0 


for  du,  and  compute  the  control  input  about  which  (3)  is  linearized  in  the  next  iteration, 


(7) 

(8) 
(9) 


(10) 


(11) 

(12) 


(13) 


U  4 —  U  -f*  du. 


(14) 


After  iterating  to  convergence,  we  apply  the  first  element  of  the  resulting  optimal  control  input  sequence  for 
one  time-step,  and  start  all  over  at  the  next  time-step. 


3.1.2  Partial-State  Feedback  Control 


Consider  the  following  system,  covering  most  robotic  systems,  where  the  dynamics  cascade  into  the  kine¬ 
matics: 


kinematics:  x*,+i  =  fx,true(xfc,  vk) 

(15) 

dynamics:  vk+1  =  fv,true(vfe,  ufc), 

(16) 

with  system  state,  zk  =  (xfc,vfe),  representing  pose,  xfe  £  Rn*,  and  velocity,  vfe  £  K”v,  separately,  and 

control  input,  u*.,  all  at  time  k.  By  substituting  vk  =  fv,true(vfc-i;  ufc-i)  into  (15),  we  can  write 

x/c+ 1  —  ^true(X^7  I?  l)- 

(17) 

Now,  if  we  assume  that  our  a  priori  model  represents  the  robot  kinematics  with  vk 

model  assumes  robot  dynamics  are  negligeable) ,  and  that  the  true  process,  ftrue(-) 

the  sum  of  our  a  priori  and  learned  models,  we  find 

=  Ufc  (i.e. ,  the  a  priori 

,  can  be  represented  by 

a  priori  model  learned  model 

Xfc+l  =  f(xfc,Ufc)  +g(xfc,Vfc_i,Ufc,Ufc_l), 

V - V - ' 

a  k 

(18) 

with  disturbance  query  state,  ak  elp, 


ak  =  (xfc,  vfc_i,  ufe,  ufc_i).  (19) 

In  other  words,  in  order  to  capture  the  dynamics  of  the  system,  the  disturbance  query  state,  a*,,  is  now 
required  to  include  historic  states.  We  can  further  define  the  corresponding  cost  function  to  be 

J(u)  =  (xd  -  x)rQx  (xd  -  x)  +  utRu,  (20) 

where  Qx  £  x k nx  jg  p0SitiVe  semi-definite,  R  and  u  are  as  in  (3),  xd  is  a  sequence  of  desired  states, 

xd  =  (xdifc+1, . . .  ,x.d,k+K),  x  is  a  sequence  of  predicted  states,  x  =  (x^+1, . . .  ,xk+x),  and  K  is  the  given 
prediction  horizon  length.  The  state,  x*,,  and  learned  model,  g(-),  are  now  of  reduced  dimension,  nx  <  n, 
while  still  capturing  both  unknown  disturbances  and  unmodelled  dynamics.  This  approach  enables  a  user 
to  provide  a  simple  a  priori  model  with  few  parameters,  if  any.  Further,  the  derivation  suggests  that  the 
approach  is  applicable  to  processes  with  even  higher-order  dynamics  by  continuing  to  add  historic  states  to 
the  disturbance  dependency. 


3.2  Gaussian  Process  Disturbance  Model 


We  model  the  disturbance,  g(-),  as  a  GP,  which  is  a  function  of  a  disturbance  dependency,  a.  The  model 
depends  on  observations  of  the  disturbances  collected  during  previous  trials,  representing  attempts  at  achiev¬ 
ing  a  control  objective,  such  as  tracking  a  path  from  start  to  finish.  At  time  fc,  we  use  the  estimated  poses, 
X*,  and  Xfc_i,  from  the  VT&R  system,  and  the  control  input,  Ufc_i,  to  isolate  (18)  for  g(afc_i), 

g(afc_i)  =  Xfc  —  f(xfc_i,ufc_i).  (21) 

We  collect  observations  for  all  sample  times  in  a  trial  and  organize  the  data  from  trial  j  into  a  set  of  data 
pairs,  T> W  =  [{a0,  g(a0)}, . . . ,  {afc,  g(afc)}, . . . ,  {a^-i,  ^(a^-i)}],  where  IVO)  is  the  number  of  time-steps  it 
took  to  travel  the  length  of  the  path  during  trial  j,  and  a^  is  as  defined  in  (19).  After  j  trials  we  have 
multiple  datasets,  that  we  combine  into  a  single  database,  V,  with  N  =  N W  +  •  •  •  + 

observations.  We  also  drop  the  time-step  index,  fc,  on  each  data  pair  in  T> ,  so  that  when  referring  to  a x>,i  or 
gp  j,  we  mean  the  zth  pair  of  data  in  the  superset  V.  Note  that  there  is  no  requirement  that  =  Ad-7-1)  as 
the  system  simply  collects  observations  as  they  occur  for  the  length  of  time  that  it  takes  to  complete  a  trial. 
Moreover,  all  experiences  are  treated  equally  as  observations  of  the  underlying  unmodelled  disturbance.  In 
fact,  the  system  collects  experience  data  whenever  it  moves  while  repeating  the  desired  path.  As  a  result, 
the  system  does  not  require  identical  initial  conditions,  termination  conditions,  or  speed  schedules. 

In  this  work,  we  train  a  separate  GP  for  each  dimension  in  g(-)  £  Rn  to  model  disturbances  as  the  robot 
travels  along  a  path.  This  approach  makes  the  assumption  that  disturbances  are  uncorrelated.  For  simplicity 
of  discussion,  we  will  assume  for  now  that  n  =  1  and  denote  g-p  ,  by  gx>,i-  The  learned  model  assumes  a 
measured  disturbance  originates  from  a  Gaussian  process  model, 

fl(a-D.i)  ~  QV  (0,  k(av,i,  a v,i)) ,  (22) 

with  zero  mean  and  kernel  function,  A:(ap  j,  a D,i),  to  be  defined.  We  assume  that  each  disturbance  measure¬ 
ment  is  corrupted  by  zero-mean  additive  noise  with  variance,  cr^,  so  that  gx>ti  =  gv,i  +  e,  e  ~  A/”(0,  cr„).  Then 
a  modelled  disturbance,  g( a*,),  and  the  N  observed  disturbances,  g  =  (gx>, i,  • . . ,  gv,N),  are  jointly  Gaussian, 

g 

S(afc) 


/  ■ 
o, 

V  . 

K  k(afc)T 
k(afe)  k( afc,afc) 


(23) 


where  K  <E  RNxN  with  (K)^  =  fc(aPii,  a-pj),  and  k(afc)  =  [&(afc,  a-p,i),  k(ak,  ax>)2), . . . ,  fc(afc,  ax>,jv)]-  In 
our  case,  we  use  the  Squared- Exponential  (SE)  kernel  function  (Rasmussen,  2006), 

k(al,aJ)  =  aj  exp^-  *  (a4  -  aJ)TM_2(ai  -  a^  +cj2  5^,  (24) 

where  Sij  is  the  Kronecker  delta,  that  is  1  if  and  only  if  z  =  j  and  0  otherwise,  and  the  constants  M,  <jf ,  and  crn 
are  hyperparameters.  The  SE  kernel  function  is  an  example  of  a  Radial  Basis  Function  (Rasmussen,  2006) 
and  is  commonly  used  to  represent  continuous  functions  based  on  dense  data.  Further,  the  SE  kernel  is 
continuously  and  analytically  differentiable,  enabling  the  rapid  computation  of  derivatives  for  the  Gauss- 
Newton  optimization  algorithm.  In  our  implementation  with  &k  £  Rp,  the  constant  M  is  a  diagonal  matrix, 
M  =  diag(m),  m  £  Kp,  representating  the  relevance  of  each  component  in  a*,,  while  the  constants,  cr2  and 
<t2  ,  represent  the  process  variation  and  measurement  noise,  respectively.  Finally,  we  have  that  the  prediction, 
g{ ak),  of  the  disturbance  at  an  arbitrary  state,  ak,  is  also  Gaussian  distributed, 

s(afc)|g  ~  A/”^k(a/t)K~1g  ,  k(ak,ak)  -  k(a*,)K_1k(afc)TJ .  (25) 


In  this  work,  we  only  make  use  of  the  predicted  mean  value  of  disturbances.  However,  in  future  work, 
the  predicted  variance  could  be  used  as  an  indication  of  the  uncertainty  in  the  learned  model  and  used 
appropriately  in  deciding  the  resulting  control  command.  Finally,  we  include  further  detail  on  the  storage 
and  retrieval  of  observations  for  online  operation  in  Section  4.3. 


3.3  Gaussian  Process  Hyperparameter  Selection 


Having  defined  the  NMPC  algorithm  and  disturbance  model,  g(ak),  it  remains  to  define  the  source  of  the 
hyperparameters,  M,  cr2,  and  cr2.  Solving  for  optimal  hyperparameters  is  not  currently  a  real-time  process 
in  our  experiments.  As  such,  we  assume  that  a  suitable  set  of  hyperparameters  has  been  determined  prior  to 
each  trial  based  on  previous  experience  (i.e. ,  from  previous  trials).  For  the  first  trial,  when  the  robot  has  no 
experience,  the  predicted  disturbance  is  zero.  Given  a  set  of  experiences,  we  find  the  optimal  hyperparameters 
offline  by  maximizing  the  log  marginal  likelihood  of  collected  experiences  using  a  gradient  ascent  algorithm 
(Rasmussen,  2006) .  In  order  to  avoid  local  maxima,  the  algorithm  is  repeated  several  times,  initialized  with 
different  initial  values,  and  the  set  of  hyperparameters  resulting  in  the  greatest  likelihood  is  selected. 


3.4  Illustrative  Example 


In  this  section,  we  highlight  the  benefits  of  LB-NMPC  and  present  an  illustrative  example  comparing:  (i) 
fixed  feedback  control,  (ii)  non-learning  NMPC,  and  (iii)  LB-NMPC.  Consider  the  following  process  model: 

zk+1=azk  +  At/3uk  +  Atdk,  (26) 

with  system  state,  zk  £  R,  control  input,  uk  £  R,  and  time-dependent  disturbance,  dk  £  R,  shown  in 
Figure  4.  Further,  a,  j3  £  R  are  unknown  constants.  In  simulation,  they  are  0.99  and  0.5,  respectively.  The 
goal  is  to  track  a  sequence  of  desired  states,  Zd,k,  as  shown  in  green  in  Figure  4,  which  is  known  to  the 
example  controllers  prior  to  starting.  The  feedback  controller  uses  a  simple  feedback  law, 

^fb.fc  —  ^fb  [Zd,k  Zk).  (27) 
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Figure  4:  Compared  to  both  simple  feedback  control  (red)  and  MPC  (black),  LB-NMPC  (blue)  is  able  to 
anticipate  and  reduce  errors  caused  by  changes  in  the  desired  state  and  unmodelled  disturbances. 


Both  the  NMPC  and  LB-NMPC  controllers  assume  a  nominal  process  model, 


Zk+l  —  Zk  T  At  Ufa,  (28) 

a  prediction  horizon,  K  =  10,  and  a  cost  function  (3),  with  Q  =  10x1  and  R  =  0.01x1,  where  1  is 
the  identity  matrix.  The  LB-NMPC  algorithm  also  includes  a  learned  disturbance  model,  as  described  in 
Section  3.1,  such  that  the  complete  system  model  used  by  the  LB-NMPC  algorithm  is 


Zk+i  =  zk  +  Atuk  +  g(zk,uk,k).  (29) 

In  this  simple  example,  the  disturbances  are  a  function  of  time  (26)  and  hence  the  learned  disturbance  is  a 
function  of  time,  k.  However,  in  practice,  we  assume  disturbances  are  time-invariant  (18). 

As  expected,  the  feedback  controller  is  incapable  of  anticipating  errors  caused  by  either  changes  in  desired 
state,  Zd,k,  or  disturbances,  c4  (Figure  4).  On  the  other  hand,  the  MPC  controller  (without  a  learned  model) 
enables  some  amount  of  predictive  control  to  reduce  tracking  errors  due  to  changes  in  desired  state.  However, 
tracking  errors  are  not  cancelled  completely  because  the  MPC  algorithm  does  not  have  the  correct  process 
model.  Finally,  the  LB-NMPC  algorithm  exploits  its  previous  experience  to  predict  and  compensate  for  both 
changes  in  the  desired  state  and  unknown  disturbances  not  anticipated  by  the  a  priori  process  model. 


4  Implementation 

4.1  Robot  Model 

In  this  paper,  robots  are  modelled  (Figure  5)  as  unicycle-type  vehicles  with  ‘position’  state  variables  (18), 
Xfc  =  (a:*,,  2/fc,  9k)-  At  every  time-step,  the  VT&R  localization  algorithm  provides  the  position  of  the  robot, 
Xfc,  relative  to  the  nearest  desired  pose  by  Euclidean  distance  (Figure  6).  The  robots  have  two  control 
inputs,  their  linear  and  angular  velocities,  u*,  =  (uCmd,i,wcmd,/c)-  The  commanded  linear  velocity,  vcmd,k, 
is  constrained  to  the  scheduled  speed  for  the  jth  trial  at  the  nearest  path  vertex,  wcmd,fc  =  ^sched  v  leaving 
only  the  angular  velocity,  wcmd,/cj  for  the  NMPC  algorithm  to  optimize  considering  (20).  Prior  to  each  trial, 
scheduled  path  speeds  are  optimized  by  the  experience-based  speed  scheduler  (Section  4.2). 


When  the  time  between  control  signal  updates  is  defined  as  At,  the  resulting  nominal  process  model  employed 
by  the  NMPC  algorithm  is 


At  cos  9k  0 


f(xfe,ufc)  =  xfc  + 


At  sin  9k 


0 


Ufc, 


0  At 


(30) 


which  represents  a  simple  kinematic  model  for  our  robot;  it  does  not  account  for  dynamics  or  environmental 
disturbances.  We  use  the  same  a  priori  model  for  all  robots  in  our  experiments,  despite  them  being  quite 
different  in  scale  (Figure  1). 


The  ‘velocity’  state  variables  are  v*,  =  (uact,fc,  wact,fe),  which  represent  the  actual  linear  and  rotational  speeds 
of  the  robot.  These  will  differ  from  the  commanded  ones,  u^,  owing  to  the  fact  that  the  robots  we  are  working 
with  have  underlying  control  loops  that  attempt  to  drive  the  robot  at  the  commanded  velocities.  However, 
the  combined  dynamics  of  the  robot  and  these  rate  controllers  are  not  modelled.  We  allow  the  LB-NMPC 
algorithm  to  learn  these  dynamics,  as  well  as  any  other  systematic  disturbances,  based  on  experience. 


In  order  to  build  and  query  the  learned  model,  g(-),  throughout  the  prediction  horizon,  we  require  all  of 
the  quantities  in  (19):  ak+b  =  (xfc+b,  vfc_i+b,  ufc+b,  ufe_i+b),  b  e  B  =  {0  . . .  K  -  1}.  We  know  ufc+b  and 
Ufc_i+b,  b  €  B,  as  these  are  commanded  inputs.  We  initially  obtain  the  robot  position  from  our  vision- 
based  localization  system,  xk  =  x*,,  then  from  our  system  model  (18),  Xfc+i+b  =  f(xfc+b,  Ufc+b)  +  g(afc+b), 
b  €  { 1 . . .  K—  1}.  Finally,  we  compute  the  velocity  state  variables,  Vfc_i+b  =  ('Uact,fc-i+b,  Wact,fc-i+f>),  based 


Figure  5:  Definition  of  the  robot  velocities,  vk  and  ojk,  and  three  pose  variables,  xkl  yk  and  9k.  At  each 
time-step,  the  VT&R  algorithm  provides  an  estimate  of  the  robot  position  relative  to  the  nearest  desired 
pose  by  Euclidean  distance. 


Figure  6:  In  practice,  our  overall  system  combines  the  LB-NMPC  algorithm  (Section  3),  the  experience-based 
speed  scheduler  (Section  4.2),  and  a  vision-based  VT&R  system  for  localization. 


on  the  computed  robot  positions, 


Wct,fc  —  1+b 


(^fc-f-6  %k—  l+b)^  “b  ijJk-\-b  Vk—l+b)^ 

At 


be  B, 


and 


^act,/c— 1+b 


( Ok+b  ~  Ok-l+b) 

At 


be  B, 


with  Xfc_i  =  Xfc_i.  Since  x*,  and  x^-i  come  from  our  vision-based  localization  system,  we  are  able  to  initialize 
the  predictive  controller  with  accurate  velocity  estimates  with  respect  to  the  ground.  This  is  preferable  to 
using  wheel  encoders  because  they  are  unable  to  measure  wheel  slip  and  other  ground-interaction  effects. 

4.2  Automated  Speed  Scheduler 

We  implemented  an  automated  speed  scheduler  (Ostafew  et  ah,  2014a)  to  demonstrate  the  LB-NMPC 
algorithm’s  ability  to  interpolate  and  extrapolate  from  learned  experiences.  The  algorithm  uses  experience 
from  previous  trials  to  schedule  speeds  that  minimize  travel  time  while  ensuring  reliable  localization,  low 
path-tracking  errors,  and  realizable  control  inputs.  Effectively,  the  scheduler  incrementally  increases  or 
decreases  speeds  along  the  path  where  possible  or  necessary,  respectively,  requiring  the  LB-NMPC  algorithm 
to  interpolate  and  extrapolate  from  previous  experience. 


When  using  vision-based  localization  systems,  there  exists  a  speed  limit  above  which  localization  becomes 
unreliable  and  the  safety  of  the  robot  can  no  longer  be  assured.  This  speed  limit  may  come  as  a  result  of 
motion  blur,  a  degraded  scene  (relative  to  when  the  path  was  taught),  or  large  deviations  from  the  path.  As 
an  indicator  of  the  conditions  faced  by  the  localization  system,  we  record  the  number  of  features  matched  by 


the  VT&R  system,  c2ature  when  passing  the  ith  vertex  during  the  jth  trial  for  use  in  the  speed  scheduler. 


We  also  record  the  lateral  and  heading  path-tracking  errors, 
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when  passing  the  ith  vertex  during  the  jth  trial.  Since  we  have  assumed  that  the  desired  path  is  safe  and 
free  of  obstacles,  it  is  important  to  maintain  low  path-tracking  errors.  Further,  the  vision  system  is  sensitive 
to  perspective  changes  between  the  teach  pass  and  any  repeat  pass.  Perspective  changes  are  the  direct  result 
of  path-tracking  errors  and  reduce  the  reliability  of  the  localization  system. 


Finally,  the  scheduled  linear  speed  also  addresses  constraints  on  angular  velocities  resulting  from  actuator 
limits.  The  true  robot  angular  velocity  differs  from  the  commanded  as  a  result  of  wheel  slip,  side  slopes,  and 
other  model  discrepancies.  As  a  result,  we  record  the  commanded  angular  velocity,  during  the  j th 

trial  when  passing  the  ith  path  vertex. 

Then  if  path-tracking  errors  are  below  (above)  a  threshold,  control  inputs  are  below  (above)  a  threshold,  and 
(or)  there  are  a  sufficient  (insufficient)  number  of  matched  features  in  a  certain  section,  the  speed  in  that 
section  is  increased  (decreased,  respectively).  Using  tuned  values  for  increasing  and  decreasing  the  scheduled 
speed,  71  >  0,  72  >  0,  respectively,  and  thresholds,  \l  >  0,  A#  >  0,  Aw  >  0,  and  Afeat  >  3,  the  scheduler 
follows  rules  to  generate  the  suggested  speeds  for  each  path  vertex: 

if  (l4?,ll  <  Xl)  A  (|e^|  <  XH)  A 

(lWcmd,il  <  ^  (Cfeature,i  >  Afeat)) 

if  (leL,il  >  AiAdb)  V  (|e^|  >  A_y Adb)  V  (3f) 

(lWcmd,il  >  A^Adb)  V(Cfg|ture  j  <  Afeat /Adb)) 
otherwise. 

Effectively,  the  automated  speed  scheduler  identifies  sections  of  the  path  where  the  system  can  tolerate  higher 
speeds  and  sections  where  it  cannot,  thus  balancing  the  trade-off  between  speed,  path-tracking  errors,  and 
vision-based  localization  reliability.  We  use  Adb  >  1  to  produce  a  deadband  where  the  speed  at  a  vertex  is 
neither  increased  nor  decreased.  For  the  first  trial,  the  scheduled  speed  at  all  vertices  in  the  path  was  set  to 
a  fixed  speed,  i  =  uinit- 


vU+1)  = 

^sched,i 


Ws(ched,i+7l 


VU)  -'V, 

sched,i 


VU) 

sched.i 


4.3  Managing  Experiences 


In  order  to  ensure  the  LB-NMPC  algorithm  is  executed  in  constant  computation  time,  our  implementation 
requires  the  ability  to  use  a  subset  of  the  observed  experiences  when  computing  a  disturbance.  Similar  to 
work  by  Nguyen- Tuong  et  al.  (2009)  and  Meier  et  al.  (2014),  we  employ  a  local  model.  However,  unlike 
their  work,  we  use  a  single  sliding  local  model.  As  experiences  are  learned,  they  are  stored  in  bins,  by 
path  vertex,  i,  and  commanded  velocity,  l  =  L^cmd.fc/WinJ  j  where  i>bin  represents  the  velocity  discretization 
and  L-J  represents  the  floor  function.  When  the  number  of  experiences  in  a  bin  exceeds  a  threshold,  Cbin, 
the  oldest  experience  in  the  bin  is  discarded.  Then,  when  computing  a  control  input  at  the  zth  vertex,  a 
‘local’  dataset  is  created,  drawing  experiences  from  bins  at  nearby  path  vertices  and  commanded  velocities, 
“D  { Drf . h I  c  t:  {i  cvertex 5  ■  ■  * )  ^  T  ^vertex }  1  b  £  { /  Greiocity;  -  ■  • ,  ^  T  ^Velocity}}  *  Thus,  models  are  effectively 
assembled  on  demand  rather  than  precomputing  hundreds  of  local  models,  enabling  a  constant-time  algorithm 
independent  of  path  length  or  deployment  time. 


5  Field  Testing 

5.1  Overview 

We  tested  the  LB-NMPC  algorithm  in  three  different  experiments  involving  three  significantly  different 
mobile  robots  (Figure  1)  and  paths  with  dirt,  gravel,  sand,  grass,  inclines,  and  side  slopes.  This  resulted  in 
over  3  km  of  learning-enabled  path-tracking  in  GPS-denied  environments.  The  three  tests  demonstrate  the 
algorithm’s  effectiveness  at  reducing  path-tracking  errors  with  only  cursory  prior  knowledge  of  the  robot’s 
behaviour  (i.e.,  that  it  could  be  treated  as  a  unicycle  robot,  Section  4.1).  Details  on  the  tuning  parameters 
are  presented  in  Section  5.2. 

The  first  experiment  (Section  5.3)  demonstrated  the  algorithm’s  ability  to  learn  unmodelled  environmental 
disturbances.  We  tested  on  a  30-m-long  path  including  slopes,  dusty  ground,  and  loose  gravel  surfaces 
(Figure  7).  The  robot  was  a  50  kg,  four-wheeled  Clearpath  Husky  robot  traveling  at  a  desired  speed  of 
0.4m/s  (i.e.,  the  automated  speed  scheduler  was  disabled  for  the  first  experiment).  With  a  0.5m  wheelbase, 
Husky  robots  are  relatively  small  and  agile  skid-steered  mobile  robots.  As  such,  the  path  included  slope 
angles  up  to  15°,  side-slope  angles  up  to  15°,  and  path  curvatures  up  to  lm-1. 


The  second  experiment  (Section  5.4)  demonstrated  the  algorithm’s  ability  to  interpolate  and  extrapolate 
from  previous  experience.  We  used  a  150  kg,  six-wheeled  ROC6  robot  (Figure  1)  learning  to  drive  at  a  range 
of  scheduled  speeds  over  20  trials  on  a  60-m-long  path.  Like  the  Husky,  the  ROC6  robot  is  a  skid-steered 
platform.  However,  the  ROC6  is  heavier  and  longer,  with  a  1.5  m  wheelbase,  and  is  better  suited  to  operate 
in  more  open  terrains  at  higher  speeds.  Scheduled  speeds  for  each  trial  were  provided  by  the  proposed 
automated  scheduler  that  used  matched  features,  path-tracking  errors,  and  control  inputs  from  previous 
trials  to  determine  safe  speeds  for  the  next  trial  (Section  4.2). 

Finally,  the  third  experiment  (Section  5.5)  further  demonstrated  the  algorithm’s  ability  to  learn  disturbances 
due  to  robot  design.  Whereas  the  first  two  experiments  involved  skid-steered  robots,  this  experiment  used 
a  600  kg,  Ackermann-steered  DMRV  robot  (Figure  1).  Traditional  path-tracking  controllers  would  represent 
the  robot  using  a  bicycle  model  (Figure  8)  with  steering  angle,  <5Cmd,fc,  and  linear  velocity,  vcmd,k,  as  control 
inputs.  However,  in  this  paper,  the  LB-NMPC  algorithm  treats  the  Ackermann-steered  robot  as  a  unicycle 
robot  with  linear  and  angular  velocity  commands,  i>cmd,fc  and  ^oCmd.k-,  respectively.  The  robot  then  converted 
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Figure  7:  The  first  and  second  experiments  were  conducted  inside  the  University  of  Toronto  Institute  for 
Aerospace  Studies  (UTIAS)  MarsDome  on  gravel,  sand,  and  loose  dirt.  The  30-m-long  path,  shown  here, 
was  used  for  the  first  experiment.  In  all  experiments,  the  nominal  unicycle  model  used  in  our  LB-NMPC 
algorithm  included  no  prior  information  on  wheel-terrain  interactions  or  robot  dynamics. 


these  velocity  commands  to  a  steering  angle,  6cmd,fc> 
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cmd,fc 


=  tan  1 


f  L  <Ucmd,fc 

\  ^cmd,fe 


(32) 


where  L  is  defined  as  the  wheelbase  of  the  Ackermann-steered  robot.  The  robot  learned  to  drive  at  a  range 
of  scheduled  speeds  over  10  trials  on  a  100-m-long  path.  The  scheduled  speeds  for  each  trial,  usched,fe,  were 
generated  using  the  same  automated  speed  scheduler  (Section  4.2)  as  was  used  in  the  second  experiment. 
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Figure  8:  Here  we  show  the  relationship  between  steering  angle  of  an  Ackermann-steered  robot,  <5cmd,fc)  and 
the  linear  and  angular  velocities,  vcmd,k  and  wcmd,fc,  respectively.  In  experiment  3,  we  used  the  LB-NMPC 
algorithm  for  path-tracking  on  a  600  kg,  Ackermann-steered  mobile  robot. 


The  first  and  second  experiments  were  performed  in  the  University  of  Toronto  Institute  for  Aerospace  Studies 
(UTIAS)  MarsDome  in  Toronto,  Ontario,  Canada  (Figure  7).  The  third  experiment  was  performed  at  the 
Defence  Research  and  Development  Canada  (DRDC)  Experimental  Proving  Grounds  in  Suffield,  Alberta, 
Canada.  In  all  experiments,  the  controller  described  in  Section  3  was  implemented  and  run  in  addition 
to  the  VT&R  software  on  a  Lenovo  W530  laptop  with  an  Intel  2.6  Ghz  Core  i7  processor  with  16  GB 
of  RAM.  The  camera  in  all  tests  was  a  Point  Grey  Bumblebee  XB3  stereo  camera.  The  resulting  real¬ 
time  localization  and  path-tracking  control  signals  were  generated  at  approximately  10  Hz.  As  previously 
mentioned,  hyperparameter  selection  is  currently  an  offline  process,  taking  up  to  5  minutes  in  the  later  trials 
of  an  experiment  when  the  system  had  accumulated  approximately  5000  experiences.  Since  GPS  was  not 
available,  the  improvement  due  to  the  LB-NMPC  algorithm  was  quantified  by  the  localization  of  the  VT&R 
algorithm.  The  VT&R  algorithm  is  based  on  Visual  Odometry  and  provides  localization  with  errors  less 
than  4cm/m  when  compared  against  GPS  ground-truth  (Stenning  et  ah,  2013). 


5.2  Tuning  Parameters 

The  performance  of  the  system  was  adjusted  using  the  NMPC  weighting  matrices  Qx  and  R,  the  experience 
management  parameters,  and  the  speed  scheduler  gains  and  thresholds.  The  weighting  matrices  for  each  test 
were  selected  in  advance  ranging  from  roughly  a  3:1  ratio  weighting  path-tracking  errors  and  control  inputs 
for  the  50  kg  Husky  to  a  1:1  ratio  for  the  600  kg  DMRV  robot.  The  increased  weighting  on  the  control  inputs 
for  the  heavier  robots  was  selected  to  ensure  controller  stability  at  higher  speeds.  Local  GP  models  were 


generated  based  on  a  sliding  window  of  size,  cvertex  =  5  and  cveiocity  =  1,  where  velocities  were  discretized  by 
^bin  =  0.25  m/s.  The  maximum  number  of  experiences  per  bin,  Cbin,  was  set  to  4,  resulting  in  local  models 
based  on  up  to  180  experiences.  Finally,  the  speed  scheduler  parameters  we  used  are  shown  in  Table  1. 
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30 
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1.1 

Table  1:  Speed  scheduler  gains  and  thresholds.  The  scheduler  was  not  used  in  experiment  1. 


5.3  Experiment  1:  Learning  to  Follow  a  Path  with  a  Fixed  Speed  Schedule 

In  the  first  experiment,  the  50  kg  Husky  robot  autonomously  travelled  the  length  of  a  30-m-long  path  for 
20  trials  at  a  fixed  speed  of  0.4  m/s  resulting  in  600  m  of  travel  (Figure  7).  Figure  9  shows  plots  of  path 
characteristics,  path-tracking  errors,  and  angular  velocity  control  input  vs.  distance  along  the  path.  The 
plots  include  comparisons  between  the  first  trial  (red),  when  the  learned  model  has  no  experience  from 
which  to  draw,  and  the  20th  trial  (blue),  when  the  learned  model  has  a  significant  amount  of  experience 
from  which  to  draw.  The  heading  and  lateral  errors,  en  and  e^,  respectively,  reached  their  peaks  in  trial  20 
(blue)  around  14-22  m  along  the  path  where  the  path  pitched  forward,  rolled  to  the  right,  and  turned  to  the 
right.  This  section  also  corresponded  to  the  largest  changes  in  control  input  between  the  first  and  last  trial. 
In  Figure  10,  we  show  plots  of  the  maximum  and  Root  Mean  Square  (RMS)  path-tracking  errors  vs.  trial 
number.  By  disabling  the  speed  scheduler  for  experiment  1,  the  learned  model  was  allowed  to  converge.  As 
a  result,  the  LB-NMPC  algorithm  successfully  reduced  the  maximum  lateral  and  heading  errors  by  roughly 
75%  in  the  first  few  trials,  then  maintained  these  errors  for  the  next  15  trials.  However,  even  after  many 
trials,  the  maximum  and  RMS  errors  continued  to  vary.  We  suspect  that  these  changes  were  due  mainly 
to  evolving  path  conditions  (e.g.,  ruts,  dirt  piles)  and  our  experience  management  scheme,  that  handles 
computational  complexity  and  changing  disturbances  by  forgetting  experiences  over  time. 

5.4  Experiment  2:  Learning  to  Follow  a  Path  at  Increasing  Speeds 

In  the  second  experiment,  the  150  kg  ROC6  robot  autonomously  travelled  the  length  of  a  60-m-long  path 
at  a  range  of  scheduled  speeds  over  20  trials  to  demonstrate  the  ability  of  the  algorithm  to  interpolate  and 
extrapolate  from  learned  experiences  (Figure  11).  The  path  for  the  second  experiment  was  mainly  on  level 
ground  but  included  path  curvatures  up  to  0.5m-1,  suiting  the  capabilities  of  the  ROC6  robot.  Figure  12 
shows  plots  of  path  characteristics,  scheduled  speeds,  and  VT&R  matched  features  vs.  distance  along  the 


Figure  9:  The  desired  path  for  experiment  1  included  slope  angles  up  to  15°,  side-slope  angles  up  to  15°, 
and  path  curvatures  up  to  1  m-1  as  estimated  relative  to  the  start  of  the  path  by  the  VT&R  algorithm.  We 
also  show  the  lateral  and  heading  path-tracking  errors,  e'[\L  and  e^\,  and  the  commanded  angular  velocity, 
Wcmd,i>  for  3  =  {1,20}. 


path.  The  speed  scheduler  (Section  4.2)  determined  where  along  the  path  the  system  could  tolerate  higher 
speeds  using  experience  from  previous  traversals,  thus  minimizing  the  travel  time  in  sequential  trials.  In  some 
sections  of  the  path  (e.g.,  at  ~22m),  the  system  took  up  to  3  trials  before  safely  increasing  the  scheduled 
speed.  This  does  not  necessarily  mean  the  learned  model  in  these  sections  had  converged,  but  only  that  the 
path-tracking  errors,  the  matched  feature  counts,  and  the  control  inputs  were  within  the  specified  limits  for 
the  speed  scheduler  (Section  5.2).  In  general,  the  speed  schedules  resulted  in  the  robot  learning  to  drive 
the  path  faster,  increasing  speeds  from  0.35  to  1.0  m/s.  Sections  of  the  path  with  poor  lighting  and  high 
curvature,  such  as  at  10,  20,  and  40  m  along  the  path,  had  relatively  low  VT&R  matched  features.  In 
these  sections,  the  speed  scheduler  suggested  increased  speeds,  though  not  as  high  as  sections  with  good 


Figure  10:  The  maximum  and  Root-Mcan-Square  (RMS)  path-tracking  errors  in  experiment  1  are  reduced 
significantly  within  the  first  few  trials.  Since  MPC  is  an  optimal  controller  balancing  path-tracking  errors 
and  control  input,  we  do  not  expect  the  path-tracking  errors  to  be  eliminated  completely. 


lighting  and  low  curvature,  such  as  at  15  or  30  nr  along  the  path.  Further,  with  learning  enabled  and  reduced 
path-tracking  errors,  the  average  number  of  matched  features  was  increased  from  38.33  to  55.77.  Since  the 
VT&R  localization  algorithm  depends  on  matching  features  between  the  live- view  and  teach-pass  view,  an 
increase  in  matches  tends  to  result  in  an  increase  in  the  localization  reliability  for  the  vision-based  mapping 
and  localization  system.  Figure  13  shows  plots  of  the  maximum  path-tracking  errors,  Root  Mean  Square 
(RMS)  path-tracking  errors,  and  overall  travel  time  vs.  trial  number.  The  LB-NMPC  algorithm  reduced  the 
lateral  and  heading  errors  by  roughly  50%  over  the  course  of  the  20  trials  while  learning  disturbances  at 
speeds  ranging  from  0.35  to  1.0  m/s. 


Figure  11:  The  second  and  third  experiments  focused  on  the  algorithm’s  ability  to  learn  unmodelled  robot 
dynamics.  Here  we  show  the  skid  steered  ROC6  robot  driving  at  0.6  m/s  with  learning  enabled.  The  white 
line  shows  the  desired  trajectory  (tire  tracks),  the  red  line  shows  a  trajectory  with  learning  disabled,  while 
the  dashed  blue  line  shows  a  trajectory  with  learning  enabled  and  reduced  path-tracking  errors. 


Figure  14  shows  the  learned  model  output  vs.  distance  along  the  path  and  commanded  speed.  Even  though 
our  system  collects  discrete  measurements  of  the  underlying  disturbance  function,  it  is  able  to  continuously 
interpolate  and  extrapolate  from  the  data.  In  the  second  experiment,  the  system  had  collected  roughly  20,000 
observations  for  the  learned  model,  retaining  only  5,000  observations  after  20  trials  based  on  our  experience 
management  scheme.  Note,  the  system  was  unable  to  travel  faster  than  0.8  m/s  at  40  m  along  the  path  due 
to  the  path’s  curvature.  As  a  result,  the  system  was  not  able  to  collect  experience  above  0.8  m/s  for  this 
section  of  the  path  and  the  resulting  modelled  disturbance  is  close  to  zero  with  relatively  high  uncertainty 
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Figure  12:  The  test  path  for  experiment  2  was  mainly  on  level  ground,  but  included  path  curvatures  up  to 
0.5m-1.  Here  we  also  show  the  scheduled  speeds,  u^edi,  for  trials  1  through  20,  and  the  VT&R  matched 
feature  counts,  %aturei,  for  trial  15. 
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Figure  13:  Here  we  show  the  reduction  in  maximum  and  Root  Mean  Square  (RMS)  lateral  and  heading 
path-tracking  errors  vs.  trial.  Unlike  the  first  experiment,  the  scheduled  speeds  for  each  trial  were  adjusted 
throughout  the  second  experiment  resulting  in  a  range  of  travel  times  when  tracking  the  loop-shaped,  60-m- 
long  path. 
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Figure  14:  Here  we  show  the  learned  values  for  the  heading  rate  disturbance  (i.e.,  the  third  element  of  g(-)) 
vs.  commanded  speed  and  distance  along  the  path.  Above  0.8  m/s,  40  m  along  the  path  (blue  ellipse),  there 
is  very  little  data  and  the  model  is  untrustworthy  (Figure  15). 


(Figure  15).  Nonetheless,  Figures  13  and  14  show  that  our  LB-NMPC  algorithm  is  capable  of  effectively 
maintaining  a  learned  model  for  many  operating  conditions  simultaneously,  learning  new  disturbances  as 
required  while  maintaining  a  wealth  of  knowledge  from  previous  experience. 

5.5  Experiment  3:  Learning  to  Follow  a  Path  at  Increasing  Speeds  with  an 
Ackermann-steered  Robot 

In  the  third  experiment,  the  600  kg,  Ackermann-steered  robot  autonomously  travelled  the  length  of  a 
100-m-long  path  demonstrating  the  ability  of  the  disturbance  model  to  learn  kinematics  and  dynamics 
of  a  significantly  different  mass  and  robot  design2  (Figure  16).  Figure  17  shows  plots  of  path  characteristics, 
scheduled  speed  and  path-tracking  errors  vs.  distance  along  the  path.  Like  the  second  experiment,  the  speed 
scheduler  tried  to  minimize  the  overall  travel  time  and  determined  where  along  the  path  the  system  could 
tolerate  higher  speeds  using  experience  from  previous  traversals.  Over  the  course  of  the  10  trials,  the  LB- 
NMPC  algorithm  reduced  the  lateral  and  heading  errors  significantly  while  learning  disturbances  at  speeds 
ranging  from  0.5 m/s  to  1.2 m/s  (Figure  18). 

This  last  experiment  highlighted  the  need  for  work  on  experience  management  and  controller  robustness. 
Between  85  and  90  m  along  the  path,  in  all  trials  of  experiment  3,  the  results  showed  a  sharp  change  in 
path-tracking  errors.  For  example,  in  trial  1,  the  VT&R  state  estimate  produced  a  step-change  in  lateral 
path-tracking  error  of  ~25  cm  in  a  single  time-step  (Figure  17).  In  reality  the  robot  made  no  such  movement. 
This  artificial  motion  estimate  was  triggered  by  what  Furgale  and  Barfoot  (2010)  called  a  ‘teach  pass  failure’, 
resulting  in  a  discontinuity  in  the  state  estimate  during  relocalization.  In  this  case,  the  LB-NMPC  algorithm 
treated  the  side-step  as  a  modelling  error  and  learned  to  turn  in  anticipation  of  the  (artificial)  disturbance, 
thereby  causing  subsequent  (real)  path-tracking  errors.  While  practical  state  estimation  algorithms  should 
avoid  providing  faulty  estimates,  the  stakes  are  higher  with  learning  algorithms  that  are  capable  of  inadver¬ 
tently  incorporating  such  outlier  measurements  into  the  learned  model  and  then  acting  on  incorrect  data. 
This  is  one  motivation  for  our  experience  management  scheme,  which  forgets  experiences  over  time. 

Figure  18  shows  plots  of  the  maximum  and  RMS  path-tracking  errors,  and  overall  travel  time  vs.  trial 
number.  The  LB-NMPC  algorithm  reduced  the  lateral  and  heading  errors  by  more  than  50%  over  the  course 
of  the  20  trials  while  learning  disturbances  at  speeds  ranging  from  0.5  to  1.2  m/s.  The  maximum  path¬ 
tracking  errors  (heading  and  lateral)  in  trials  5,  7,  9,  and  10  occured  at  the  aforementioned  section  of  the 
path  between  85  to  90  m,  where  the  localization  system  indicated  an  (artificial)  disturbance. 


2Associated  video  at  http://tiny.cc/RoverLearnsDisturbances 


6  Discussion 


6.1  Controller  Robustness 

In  general,  our  LB-NMPC  algorithm  is  initialized  with  a  known  nominal  model  and  learns  the  discrepancies 
between  the  known  model  and  the  actual  robot  behavior.  Therefore  by  its  very  structure,  the  augmented 
process  model  used  by  our  NMPC  algorithm  has  varying  levels  of  uncertainty  while  learning.  Controller 
robustness,  i.e.  the  capability  of  a  controller  to  stabilize  a  system  in  spite  of  model  uncertainty,  is  an  open 
question  for  learning  controllers  in  general  (Schaal  et  ah,  2010).  In  this  paper,  we  do  not  explicitly  consider  the 
robustness  of  the  controller  but  focus  on  the  practical  application  of  LB-NMPC  to  mobile  robots.  However, 
having  established  the  effectiveness  of  the  LB-NMPC  algorithm  at  reducing  control  errors  with  few  a  priori 
assumptions,  our  future  work  will  focus  on  techniques  to  leverage  the  covariance  estimates  provided  by  our 
GP-based  model  in  a  robust  control  framework.  Ideally  the  controller  will  automatically  choose  between 
conservativeness,  when  the  model  is  relatively  uncertain,  and  optimality,  when  the  model  is  less  uncertain. 


Figure  15:  Modelled  disturbances,  g(-)  =  (ff(i)(‘)>5(2)(‘)jS(3)(‘));  f°r  Vd  =  0.9m/s.  With  no  experience 
above  0.8  m/s,  40  along  the  path,  the  modelled  disturbance  is  zero  and  relatively  uncertain. 


Figure  16:  In  experiment  3,  we  tested  with  a  600  kg,  Ackermann-steered  robot  repeating  a  100-m-long  path. 
Like  the  previous  experiments,  the  nominal  model  used  by  the  LB-NMPC  algorithm  was  a  unicycle  model, 
demonstrating  the  algorithm’s  ability  to  be  applied  to  robots  with  significantly  different  designs. 

6.2  Convergence  Rates 

Determination  of  convergence  rates  is  also  an  open  problem  in  model-based  learning  controllers  (Nguyen- 
Tuong  and  Peters,  2011).  Unlike  techniques  such  as  Iterative  Learning  Control  (Bristow  et  al.,  2006;  Ahn 
et  al.,  2007),  which  assume  identical  initial  conditions  and  desired  trajectories  for  all  trials  in  order  to  make 
claims  on  convergence  rates,  model-based  learning  controllers,  such  as  the  work  illustrated  in  this  paper, 
address  a  more  general  problem  trying  to  learn  with  arbitrary  initial  conditions,  paths,  and  speed  schedules. 
This  enables  a  more  flexible  robot  use  since  it  is  able  to  learn  more  than  one  path  at  more  than  one  speed. 
However,  it  also  presents  essentially  a  sporadic  approach  to  learning,  in  that  it  is  not  guaranteed  when  or 
if  a  state  will  be  revisited  for  continued  learning.  Furthermore,  convergence  rates  are  complicated  by  the 
evolution  of  the  environment  caused  by  the  robot’s  activity.  For  example,  repeating  the  same  path  caused 
ruts  to  form  which  resulted  in  a  change  in  the  disturbances  affecting  the  nominal  process  model.  This  was 
also  a  motivation  in  using  only  the  most  recent  observations  (Section  4.3). 


7  Conclusion 

In  summary,  this  paper  presents  a  Learning-based  Nonlinear  Model  Predictive  Control  (LB-NMPC)  algo¬ 
rithm  for  a  path-repeating,  mobile  robot  negotiating  large-scale,  GPS-denied,  outdoor  environments.  The 
goal  is  to  reduce  path-tracking  errors  using  real-world  experience  instead  of  pre-programming  accurate  ana- 


Figure  17:  The  path  for  the  third  experiment  formed  a  large  loop  with  turns  at  30,  45,  and  75  m  along  the 
path  (Figure  18).  As  in  experiment  2,  the  robot  learned  to  drive  the  path  at  a  range  of  speeds,  from  0.5 
to  1.2  m/s,  generated  by  the  speed  scheduler.  At  around  50  m  along  the  path,  it  took  three  trials  before 
path-tracking  errors  were  reduced  sufficiently  such  that  the  scheduled  speed  could  be  increased. 


lytical  models  of  wheel-terrain  interaction,  terrain  topology,  or  robot  dynamics.  The  LB-NMPC  controller  is 
based  on  a  fixed,  simple  process  model  and  a  learned  disturbance  model.  Disturbances  effectively  represent 
measured  discrepancies  between  the  given  nominal  model  and  the  observed  system  behaviour.  Disturbances 
are  modelled  as  a  Gaussian  Process  (GP)  based  on  observations  as  a  function  of  relevant  variables  such  as 
the  system  state  and  input.  Modelling  the  disturbances  as  a  GP  enables  the  algorithm  to  learn  complex 
nonlinear  model  discrepancies  and  to  generalize  to  novel  situations.  Localization  for  the  controller  is  pro- 


Figure  18:  Over  the  course  of  10  trials,  the  LB-NMPC  algorithm  reduced  the  lateral  and  heading  path¬ 
tracking  errors  by  over  50%,  while  simultaneously  learning  to  drive  at  faster  speeds  around  the  loop-shaped, 
100-m-long  path.  The  desired  speeds  were  provided  by  the  automated  speed  scheduler  (Section  4.2). 

vided  by  an  on-board,  Visual  Teach  &  Repeat  mapping  and  navigation  system.  The  paper  also  presents  an 
experience-based  speed  scheduler  that  plans  time-optimal  schedules  while  guaranteeing  low  path-tracking 
errors  and  reliable  localization. 

Three  experiments  on  three  significantly  different  robots,  including  over  3  km  of  travel  on  challenging  paths, 
demonstrated  the  system’s  ability  to  handle  unmodelled  terrain  and  robot  dynamics,  and  also  to  interpolate 
and  extrapolate  from  learned  disturbances.  In  the  second  and  third  experiments,  the  experience-based  speed 
scheduler  addressed  the  classic  exploration  vs.  exploitation  trade-off  balancing  speed,  path-tracking  errors, 
and  localization  reliability.  The  LB-NMPC  approach  proved  to  be  flexible  and  effective  at  reducing  path¬ 
tracking  errors  and  increasing  the  reliability  of  the  localization  system.  Even  beginning  with  only  the  simple 
unicycle  model,  the  algorithm  was  capable  of  being  seamlessly  deployed  to  multiple  platforms  where  it  learned 
to  reduce  vehicle-  and  trajectory-specific  path-tracking  errors  using  experience.  However,  robust  stability  is  a 
large  unanswered  question  for  state-of-the-art  learning  control  algorithms.  In  this  work,  we  only  make  use  of 
the  predicted  mean  value  of  disturbances.  However,  in  future  work,  we  plan  on  investigating  robust  learning 
control,  leveraging  the  predicted  disturbance  uncertainty  as  well  as  mean  when  making  control  decisions. 


8  Appendix:  Multimedia  Description 


A  multimedia  extension  has  been  prepared  to  accompany  this  work.  The  extension  shows  the  results  from 
experiment  3,  where  the  learning-based  controller  is  tested  on  a  600  kg  Ackermann-steered  robot  and  learns 
to  reduce  vehicle-  and  trajectory-specific  path-tracking  errors.  The  extension  also  shows  the  experience-based 
speed  scheduler  generating  speed  schedules  for  each  trial.  The  video  is  available  as  a  Supporting  Information 
file  in  the  online  version  of  this  article  or  at  http://tiny.cc/RoverLearnsDisturbances. 
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